Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add minhash deduplicator based on RAY. #502

Open
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

chenyushuo
Copy link
Collaborator

@chenyushuo chenyushuo commented Nov 28, 2024

Unlike #489, the main approach here is based on Ray Actor's implementation of multi process union find set to complete equivalence class merging.

@yxdyc yxdyc requested review from yxdyc, HYLcool and pan-x-c December 11, 2024 08:04
@yxdyc yxdyc added dj:op issues/PRs about some specific OPs dj:dist issues/PRs about distributed data processing labels Dec 11, 2024
@yxdyc yxdyc added the dj:efficiency regarding to efficiency issues and enhancements label Dec 20, 2024
@chenyushuo chenyushuo changed the title [WIP] Add minhash deduplicator based on RAY. Add minhash deduplicator based on RAY. Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dj:dist issues/PRs about distributed data processing dj:efficiency regarding to efficiency issues and enhancements dj:op issues/PRs about some specific OPs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants